Search CORE

75 research outputs found

An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing

Author: A Fernandez
A Fernandez
A Fernandez
CLP Chen
D García-Gil
GEAPA Batista
H Karau
J Huang
J Maillo
MJ Basgall
NV Chawla
PD Gutierrez
R Barandela
RC Prati
S Ramírez-Gallego
T White
V López
X Meng
Publication venue
Publication date: 03/09/2019
Field of study

Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).Instituto de Investigación en Informátic

Crossref

Servicio de Difusión de la Creación Intelectual

An insight into imbalanced Big Data classification: outcomes and challenges

Author: A Fernández
A Fernández
A Thusoo
B Krawczyk
C Bunkhumpornpat
CP Chen
D Lyubimov
E Elsebakhi
E Ramentol
F Hu
F Hu
G Haixiang
GEAPA Batista
GM Weiss
H He
H Yu
I Triguero
I Triguero
J Alcalá-Fdez
J Dean
J Huang
J Li
JA Sáez
JM Tomczak
K Kambatla
L Rokach
M Galar
M Galar
M Wasikowski
NV Chawla
NV Chawla
PC Zikopoulos
R Baeza-Yates
R Barandela
R Blagus
RC Prati
S Alshomrani
S Barua
S Elhag
S Kamal
S Owen
S Río
S Río
S-H Park
T Jo
T White
V García
V López
V López
V López
X Meng
X Wu
Y Guo
Y Sun
Y-S Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Repositorio Institucional Universidad de Granada

Improving Supervised Classification Using Information Extraction

Author: A Puurula
D Rao
DD Lewis
E Gabrilovich
F Gullo
G Forman
G Tsoumakas
G Tsoumakas
J Piskorski
K Crammer
M Atkinson
M Du
M Hall
R Grishman
RC Prati
S Dendamrongvit
S Huttunen
S Patwardhan
S Wang
W Zhang
Y Liu
Y Yang
Z Erenel
Publication venue: Springer International Publishing AG
Publication date: 01/01/2015
Field of study

Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

ROCCER: an Algorithm for Rule Learning Based on ROC Analysis

Author: Flach PA
Prati RC
Publication venue: IJCAI
Publication date: 01/08/2005
Field of study

Explore Bristol Research

Mining DEOPS records: big data's insights into dictatorship

Author: Navarro DDM
Prati RC
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Historical data provide valuable information for the nderstanding of human interactions through time. However, mining this data is challenging as the available records are generally noise digitized handwritten, typewritten or press printed documents. In this research proposal, we plan to develop tools and techniques for pre-processing and extracting information from documents of the military dictatorship period that ruled Brazil from 1964 to 1985. The data to be analyzed consists of digitized images of records from DEOPS/SP (São Paulo State Department of Political and Social Order), an emblematic police agency which have monitored (and in some cases, harassed and tortured) hundreds of thousands Brazilian citizens during that period. The idea is to use state-of-the-art powerful artificial intelligence algorithms in conjunction with crowd sourcing techniques to pre-process and extract information from this important period of the Brazilian History

Oxford University Research Archive

Mining DEOPS records: big data's insights into dictatorship

Author: Navarro DDM
Prati RC
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Oxford University Research Archive

cid: a rapid and efficient bioinformatic tool for the detection of SSRs from genomic libraries

Author: Ewing B
Ewing B
Green P
Heineman GT
Mahalakshmi V
Mahalakshmi V
Prati RC
Prati RC
Rozen S
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

F-Measure Curves for Visualizing Classifier Performance with Imbalanced Data

Author: B Krawczyk
C Drummond
C Ferri
I Pillai
RC Prati
T Fawcett
ZC Lipton
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Training classifiers using imbalanced data is a challenging problem in many real-world recognition applications due in part to the bias in performance that occur for: (1) classifiers that are often optimized and compared using unsuitable performance measurements for imbalance problems; (2) classifiers that are trained and tested on a fixed imbalance level of data, which may differ from operational scenarios; (3) cases where the preference of correct classification of classes is application dependent. Specialized performance evaluation metrics and tools are needed for problems that involve class imbalance, including scalar metrics that assume a given operating condition (skew level and relative preference of classes), and global evaluation curves or metrics that consider a range of operating conditions. We propose a global evaluation space for the scalar F-measure metric that is analogous to the cost curves for expected cost. In this space, a classifier is represented as a curve that shows its performance over all of its decision thresholds and a range of imbalance levels for the desired preference of true positive rate to precision. Experiments with synthetic data show the benefits of evaluating and comparing classifiers under different operating conditions in the proposed F-measure space over ROC, precision-recall, and cost spaces

Crossref

Archivio istituzionale della ricerca - Università di Cagliari

When is undersampling effective in unbalanced classification tasks?

Author: A Dal Pozzolo
A Estabrooks
GEAPA Batista
J Van Hulse
M Saerens
N Japkowicz
RC Prati
T Jo
V Garc\’ıa
Publication venue
Publication date: 01/01/2015
Field of study

info:eu-repo/semantics/publishe

CiteSeerX

Crossref

DI-fusion

Setting Parameters for Support Vector Machines using Transfer Learning

Author: C Soares
Gabriela Oliveira Biondi
J Hernández-Orallo
J Vanschoren
JM Sotoca
M Basu
M Reif
O Chapelle
RC Prati
Ronaldo Cristiano Prati
S Ali
SW Lin
TAF Gomes
TK Ho
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref